-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fabric: Add fi_hmem_attr to fi_info #10400
Conversation
FI_HMEM is only an on/off capability bit but there are more specific capabilities. Add fi_hmem_attr to fi_info to provide better control over: - whether device API calls are permitted in libfabric, except for FI_HMEM initialization. - whether peer to peer transfers should be used. - whether optimized memcpy for device memory should be used. Signed-off-by: Jessie Yang <[email protected]>
Duplicate and free the fi_hmem_attr data if present. Signed-off-by: Jessie Yang <[email protected]>
Print fi_hmem_attr when passing fi_into to fi_tostr. Signed-off-by: Jessie Yang <[email protected]>
Check FI_HMEM is set in caps when using fi_hmem_attr, and user do not request attributes against what prov can offer. Add alter fi_hmem_attr. Signed-off-by: Jessie Yang <[email protected]>
Update man pages for fi_hmem_attr. Signed-off-by: Jessie Yang <[email protected]>
@@ -303,6 +311,7 @@ struct fi_info_1_7 { | |||
struct fi_domain_attr_1_7 *domain_attr; | |||
struct fi_fabric_attr_1_7 *fabric_attr; | |||
struct fid_nic_1_7 *nic; | |||
struct fi_hmem_attr_1_7 *hmem_attr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to this file need to be dropped. These are definitions for the prior ABI structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
enum fi_hmem_attr_opt use_p2p; | ||
enum fi_hmem_attr_opt use_dev_reg_copy; | ||
struct fi_hmem_attr *next; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are more along the lines of a configuration than an attribute. The app is wanting to configure the provider behavior here, versus discovering provider capabilities.
I want to avoid an embedded linked list. It adds complexity, and it's unlikely that a system will realistically have more than 1 type of hmem installed anyway.
My first thought is this should somehow be linked to memory registration, since that's ultimately where the provider is making decisions on how to perform a data transfer. And the above configuration settings are basically letting the provider know what sort of optimizations it can perform when a data buffer is located in hmem.
Maybe there should be a more involved set of APIs to query and configure memory registration related functionality. I can see the MR cache being part of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to avoid an embedded linked list. It adds complexity, and it's unlikely that a system will realistically have more than 1 type of hmem installed anyway.
We were debating internally on that. If we are confident there wouldn't be > 1 hmem type, we can get rid of the linked list structure. We start from such implementation for flexibility of multiple hmem iface, and we are open to feedback on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to add it to fi_info for the following reasons:
- the user can see if these configurations are being used by running fi_info.
- The
api_permitted
field is not limited to memory registration. It is also used to prevent the application and libfabric from touching the same resources and negatively impacting the performance. - Adding
use_p2p
to fi_info can help filter out SHM provider early and allow applications to specify preference for P2P mode early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These settings are configurations (restrictions) specified by the application, not the provider. They differ from the fi_info attributes in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The application may not care about the implementation details and leave these configurations unspecified, then it is up to the provider to choose whether to use them and return in fi_info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values make no sense for a provider to return: REQUIRED, PREFERRED, etc. These are application restrictions on the provider implementation. If the app doesn't specify a restriction, it doesn't care what the provider does. Except that an admin might... Some restrictions don't even make sense for some providers, except to disable the provider completely.
There are a much larger set of restrictions that an application may need to set. HMEM settings, MR cache controls, use of CPU atomics, use of shared memory, eager message sizes, receive side buffering for unexpected messages... Whether these restrictions are per provider, per endpoint, per domain, or global is unknown.
struct fi_info isn't the place for this. Consider the proposal has a linked list of these restrictions. This is input into fi_getinfo(). What is a provider supposed to do with this list? Pick the one it likes? Apply all of them? How does it resolve conflicts? What does a provider do if it uses shared memory for local communication, where a setting doesn't apply?
Conceptually, the application or an administrator is programming the provider implementation through some sort of configuration mechanism. That's typically been done using environment variables, or setop() when done programmatically. We don't want to link all these configuration values off of fi_info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, the application or an administrator is programming the provider implementation through some sort of configuration mechanism. That's typically been done using environment variables, or setop() when done programmatically. We don't want to link all these configuration values off of fi_info.
We always prefer a programmatical way to do such configuration instead of environment variables. I explained why setopt
is not an ideal place either for such configuration in the thread below.
There are a much larger set of restrictions that an application may need to set. HMEM settings, MR cache controls, use of CPU atomics, use of shared memory, eager message sizes, receive side buffering for unexpected messages... Whether these restrictions are per provider, per endpoint, per domain, or global is unknown.
There are always adhoc or provider specific configuration that you don't want to involve in an API. What we are achieving is to address the common pain points in the FI_HMEM interface that all providers may share by introducing incremental changes to the interface.
Among the 3 attributes we are introducing, use_p2p
and api_permitted
are something AWS or HPE (@iziemba correct me if wrong) showed interest. use_dev_reg_copy
is something on the data transfer details that we can totally cut.
@@ -465,6 +481,14 @@ struct fi_fabric_attr { | |||
uint32_t api_version; | |||
}; | |||
|
|||
struct fi_hmem_attr { | |||
enum fi_hmem_iface iface; | |||
enum fi_hmem_attr_opt api_permitted; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is acting as a bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields will be both input and output. Since a boolean can represent either false or an unspecified value, FI_HMEM_ATTR_UNSPEC is added to differentiate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either the provider can call the GPU API or not. There's not a third state here.
These settings are dictating to the provider how it must implement data transfers to/from GPU buffers. For some providers, it means that the provider cannot support GPU buffers at all. (There is no PCI peer to peer support for TCP or shmem.) There' a significant difference between these variables and other attribute values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means that the provider cannot support GPU buffers at all. (There is no PCI peer to peer support for TCP or shmem.)
I think that is part of our goal for making it in fi_info because it can help the provider filtering. For provider that doesn't support PCIe peer-to-peer, like shmem, we can use such configuration to filter it early in fi_info. You remember we have had challenges to toggle shmem on/off inside EFA provider without using environment variables. You believed shm usage is a data transfer level decision so making it as a ep level setopt() makes more sense. The option can be either a general FI_OPT_SHARED_MEMORY_PERMITTED
or FI_OPT_HMEM_P2P
We currently use FI_OPT_SHARED_MEMORY_PERMITTED
, but such toggle in ep level is still too late for us because we have created shm info/domain/av/cq/mr earlier. Cleaning all of them at an ep call is troublesome and also error-prone.
Making such toggle as early as in fi_info level can resolve such challenge.
There' a significant difference between these variables and other attribute values.
I agree on this point. If possible, we can consider alternatives to move them to appropriate attribute groups (like domain/tx/rx/ep_attr) separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree having this information up front is useful. I disagree that these are extended attributes. They're something else. Environment variables are the closest thing to what this intends to capture.
Imagine 2 upper libraries (e.g. MPI and NCCL) calling libfabric. These libraries could drive a provider in different directions. NCCL says "no, don't use CUDA", but MPI says "go ahead and use CUDA". That doesn't work. There are global settings at play here, not per domain or endpoint settings. These settings may not even be per provider. NCCL might be using ucx, but MPI verbs, yet the restrictions from NCCL need to carry over both providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These libraries could drive a provider in different directions. NCCL says "no, don't use CUDA", but MPI says "go ahead and use CUDA". That doesn't work.
This is exactly the problem we are solving. In the consideration of resource management and other factors, making Libfabric use cuda calls in both control and data transfer interfaces for NCCL application may cause unexpected risks and overhead. But we don't have such concern for MPI application.
NCCL might be using ucx, but MPI verbs, yet the restrictions from NCCL need to carry over both providers.
Can you explain to me the software stack here? I know NCCL can use UCX/OFI for network offload via the plugins. It can also use MPI (via a NCCL-MPI plugin ?) for the same purpose?
I learned NCCL is already warning users that using NCCL and CUDA aware MPI together is not safe: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referring to a single process using BOTH NCCL and MPI. Or a single process accessing libfabric through more than 1 middleware (NCCL and DAOS, MPI and DAOS, etc.). The point is that these settings aren't per domain or per provider but are global to the process. That is, they are settings which apply to the environment as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL doesn't like users to use CUDA aware MPI calls in the same progress because it's already not safe. So I don't see a reason we cannot make MPI allow libfabric to use CUDA while NCCL doesn't allow?
FI_HMEM_SYNAPSEAI, | ||
}; | ||
|
||
enum fi_hmem_attr_opt { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use the work 'opt' here because there are setopt calls to change some of these values. But these hit me more as configuration settings, than attributes. See below.
Viewing the hmem settings as a global environment configuration, the best option I can think relates to fi_getparams(), fi_freeparams(), and fi_param_get(). Introducing fi_param_set() would allow a user to configure libfabric prior to calling fi_getinfo(), and the results would be global. This does mean that providers, including DL builds, need to be updated to check the values and adjust their implementation accordingly. The libfabric core may need to handle fi_param_set() being called prior to fi_param_define(); I'm uncertain on this. Providers will need to read the parameter, and probably cache the result, after fi_getinfo() has been called for the first time, so the app has time to call fi_param_set(). |
@shefty How is |
fi_param_set() behaves the same as setting an environment variable directly. I doubt it ends up being any different than calling setenv(). The app will want to make the call to constrain the provider and allow things to work correctly. This isn't something libfabric forces on the app; it's the other way around. Again, fi_info configures an instance of some object. That's not what's needed here. CUDA requires a global configuration. |
If it works the same as setting the environment variable, why do we need to add fi_param_set()? |
We can advise application which want to set such params programmatically via |
@jiaxiyan setting environment variable via |
@shijin-aws - Yes, I think that would work. I would usually/always defer to using an administrative setting (e.g. export) in case of conflicts. I don't know what the actual limitations are regarding multi-thread support with setenv(), except that it's documented as not being thread safe like you mentioned. |
@shefty Can we make |
I'm not following the question. Generically, I think the code just needs to handle fi_param_set() being called prior to whatever variable it's trying to modify may be defined. Consider that provider specific variables may be defined late, maybe not until the app calls fi_getinfo(). I'm picturing fi_param_set() aligning with fi_param_get(). To add confusion, there's another option for setting 'complex' configuration values. See fi_open(). That call could be extended to support "hmem". A user could do something like this:
(The syntax is incorrect.) hmem_fid would be a pointer to internal, global hmem settings. Once the user has the fid, we have the full range of the API, including custom ops. fi_open() is what's used to replace the mr_cache and logging components. |
PyTorch has solved all its problems with threads, so during initialization there's a ton of PyTorch threads running around calling getenv(). Calling setenv() during that time can cause the environ pointer to be copied / freed, such that getenv() is walkng down a since-freed environ array. And then there is the inevitable segfault. This is not theoretical; it is significantly worse on platforms like Trainium where the default execution mode is 1 process with a thread group per accelerator compared to Nvidia's default of 1 process per accelerator, but it can happen any time the upper layer allows multiple threads during network initialization. |
@shefty Thinking about it more, I agree with your statement
And most of the configurations we like to introduce in this PR is already covered by the option names in I think the hmem related options can definitely be set in the domain level, but we don't have a setopt/getopt for domain. Do you think we can extend fi_setopt/fi_getopt to make it query/set contributions for both endpoint and domain levels, or introducing a domain specific setopt/getopt? I am slightly inclined to make
that only covers endpoint level. And passing such enum to a ep-level only |
Extending setopt to cover the domain isn't a problem. I don't have an issue with that. But a domain level option is still an option set on the instance of some object. AFAICT, some of the hmem configuration settings need to be global. Can it realistically work if one domain uses CUDA calls, but another cannot? Again, consider the case where a single process accesses libfabric through multiple middleware libraries (e.g. NCCL, MPI, Mercury). Assuming we're not dealing with statically linked libraries (environment variables are the only option there), if one of the middleware components needs to disable CUDA, that needs to apply to all components. It was probably a mistake to introduce FI_OPT_CUDA_API_PERMITTED, unless the app has some sort of weird way of dedicating GPUs to NICs that somehow makes this work. |
I don't think disabling CUDA is necessarily a global setting. For example, NCCL doesn't allow Libfabric to do cudaMemcpy to move data between bounce buffer and the cuda buffer that cuda buffer (passed by NCCL) may already be in a cudaStream so it doesn't want the network library to mess it up. Open MPI AFAIK doesn't do such stream things and it should permit cuda API inside Libfabric. Such different behavior depends on how application (MPI and NCCL) here manage their cuda buffer, Similar example is for cuPointerSetAttribute, which is usually used to set some flags to a cuda pointer, e.g. enforce a synchronous memcpy from/to such cuda pointer. Libfabric or other communication libraries need to set such attribute to the passed in cuda buffer to ensure the data integrity. However, this attribute isn’t supported on memory was allocated using the VMM APIs (aka CUDA driver APIs). NCCL was using such cuda driver API to allocate the buffers passed to Libfabric, and then Libfabric cannot set such attribute then. Such discrepancy on the CUDA memory management between different middlewares makes me believe this cuda api permitted shouldn't be a global setting. It may be special for Nvidia because of its complicated API family and history, but having such attribute set via domain or other libfabric resource gives us enough flexibility to handle similar issue in the future. |
A domain setopt is only marginally different than an endpoint setopt, which already exists. If anything, an endpoint setopt is more flexible. The larger problem is avoiding circular dependencies between the compute API and communication API. If a compute kernel is waiting on network data, that forms a compute -> communication dependency. If the communication library calls the compute API to complete the transfer, that can create a reverse communication -> compute dependency. Note that it doesn't matter if the compute API (e.g. CUDA) is called directly by the app or through middleware (e.g. NCCL), and the same is true of the communication API (e.g. direct calls to libfabric, libfabric via NCCL, libfabric via MPI). It's possible such circular dependencies won't result in deadlock, but it's difficult to determine. For an app using middleware (NCCL, MPI), the app may not be able to guarantee it won't. Avoid deadlock in this case may require a big hammer that disables all communication -> compute dependencies. AFAIK, a big hammer is required by NCCL, but would also work with all other cases, so I think we should design controlling this library wide. There's already a more refined option available. (I just doubt it can realistically work in practice). |
FI_HMEM is only an on/off capability bit but there are more specific capabilities. Add fi_hmem_attr to fi_info to provide better control over: